OPEN 3 ACCESS Freely available online 



PLOS 



ONE 



Mining Gene Expression Data of Multiple Sclerosis 

Pi Guo^^ Qin Zhang^^ Zhenii Zhu\ Zhengliang Huang^ Ke Li^* 

1 Department of Public Health, Shantou University Medical College, Shantou City, Guangdong Province, China, 2 Good Clinical Practice Office, Cancer Hospital of Shantou 
University IVledical College, Shantou City, Guangdong Province, China, 3 Laboratory of Cell Senescence, Shantou University Medical College, Shantou City, Guangdong 
Province, China 



Abstract 

Objectives: Microarray produces a large amount of gene expression data, containing various biological implications. The 
challenge is to detect a panel of discriminative genes associated with disease. This study proposed a robust classification 
model for gene selection using gene expression data, and performed an analysis to identify disease-related genes using 
multiple sclerosis as an example. 

Materials and methods: Gene expression profiles based on the transcriptome of peripheral blood mononuclear cells from a 
total of 44 samples from 26 multiple sclerosis patients and 18 individuals with other neurological diseases (control) were 
analyzed. Feature selection algorithms including Support Vector Machine based on Recursive Feature Elimination, Receiver 
Operating Characteristic Curve, and Boruta algorithms were jointly performed to select candidate genes associating with 
multiple sclerosis. IVlultiple classification models categorized samples into two different groups based on the identified 
genes. Models' performance was evaluated using cross-validation methods, and an optimal classifier for gene selection was 
determined. 

Results: kn overlapping feature set was identified consisting of 8 genes that were differentially expressed between the two 
phenotype groups. The genes were significantly associated with the pathways of apoptosis and cytokine-cytokine receptor 
interaction. TNFSF10 was significantly associated with multiple sclerosis. A Support Vector Machine model was established 
based on the featured genes and gave a practical accuracy of ~86%. This binary classification model also outperformed the 
other models in terms of Sensitivity, Specificity and Fl score. 

Conclusions: The combined analytical framework integrating feature ranking algorithms and Support Vector Machine 
model could be used for selecting genes for other diseases. 
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Introduction 

As powerful tools for facilitating the discovery of totally novel 
and unexpected functional roles of genes, gene expression 
microarrays have been applied to a range of applications in 
biomedical research and produce a large number of databanks 
containing various amounts of hidden biological information [1]. 
The key resides in the ability to analyze large amounts of data to 
detect a panel of genes capable of discriminating diseases. This 
study proposed a modeling framework for establishing a robust 
classification model, for identification of disease-related genes. We 
utilized the proposed modeling approach for identification of 
genes involved in multiple sclerosis. 

Multiple sclerosis is characterized as an inflammatory disorder 
of the central nervous system in which focal lymphocytic 
infiltration leads to damage of myelin and axons [2] . The trigger 
for multiple sclerosis is unclear so far, although it is generally 
evaluated as an autoimmune disease [3] . At present the diagnosis 
of multiple sclerosis usually involves the tests of lumbar puncture 
or magnetic resonance imaging scan of the brain function. The 



diagnostic ways are either clinically invasive or expensive for 
multiple sclerosis patients. High throughput technique of micro- 
array has been applied to measure gene expression patterns of 
multiple sclerosis, and the challenge is to develop more effective 
approaches to identify a panel of genes that go beyond over-or- 
under expressing genes from the big data. In this study we 
reanalyzed the microarray dataset of multiple sclerosis from 
Brynedal et al. [4] using data mining methods, and selected 
discriminative genes. The computationally intensive methods of 
data mining provide us an effective way to rank features, allowing 
a careful selection of feature sets for optimal classification fitting. 
Therefore, we were able to investigate some genes with potential 
biological implications from microarray data. The aim of this 
study was to build a robust classification model with characteristics 
of feature selection and sample prediction. 

Prior studies showed that combinatorial gene selection methods 
could be effectively apphed to identify the gene signature for 
disease [5] . Zhou et al. [6] conducted a union method combining 
two feature selection algorithms, and identified significant risk 
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factors for osteoporosis from a very large amount of candidates. 
Tliis work introduced a combinational strategy to predict multiple 
sclerosis samples using microarray data. In the initial stage, a 
feature selection algorithm was used to extract the biologically- 
interpretahle genes. A combined approach integrating three 
feature s(;lec:ti{)n algorithms including Support Vector Machine 
based on Recursive Feature Elimination (SVM-RFE) [7], Receiver 
Operating Characteristic (ROC) Curve [8], and Boruta [9] was 
performed to rank genes, and order genes based on their 
importance. Then, an overlapping set of genes was selected. The 
SVM-RFE algorithm can eliminate gene redundancy automati- 
cally, retain a better and more compact gene subset, and yield a 
better classification performance. The ROC algorithm is to 
characterize a best separation between the distributions for two 
groups, and is easy to implement. The Boruta algorithm measures 
the importance of each feature. These three feature selection 
algorithms had high performance in learning, and their outputs 
were easy to understand. 

We constructed six classical models including SVM, Random 
Forests, naive Bayes, Artificial Neural Network, Logistic Regres- 
sion and k-Nearest Neighbor to predict samples based on the 
feature subset. These models are widely employed in gene 
classification and have practical predicting performance. We 
introduced these techniques to classify the samples, evaluated them 
using cross-validation methods, and then utilized the optimal 
model to construct a gene selection model. As evaluated by several 
statistical metrics, an optimal SVM model was proposed, and it 
has shown to be useful for selecting disease-related genes in 
multiple sclerosis. 

Materials and Methods 

The process of data collection and analysis is illustrated in 
Figure 1, and the details of each step can be found in the following 
subsections. 

Data Collection and Processing 

Gene expression profiles for a total 44 subjects were obtained 
from the ArrayExpress Database under accession number of E- 
MTAB-69. Accordingly, global gene expression in peripheral 
blood mononuclear cells samples was assessed in 26 multiple 
sclerosis patients. For the control, a population consisting of 18 
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Figure 1. Flow chart of data analysis. The four major steps of this 
study: data preprocessing, feature selection, model building, and 
performance validation. 
doi:1 0.1 371/journal.pone.01 00052.g001 



individuals with other neurological diseases was also examined to 
assess their specific expression profiles of multiple sclerosis. The 
transcripts of peripheral blood lymphocyte were hybridized 
individually to the Human Genome 133 plus 2.0 arrays 
(Affymetrix, Santa Clara, CA) platform according to standard 
operating protocols. A full description of experimental protocols 
and processes can be viewed in the study conducted by Brynedal et 
al. [4]. 

The raw fluorescence intensity data were converted to gene 
expression values using the Robust Multichip Analysis algorithm 
[10] in Expression ConsoleTM Software. Each exprc-ssion profile 
containing 54, 675 probe sets was preprocessed including 
background correction, normalization and summation of the 
intensities for each sample [10]. Probes with less discriminative 
power were removed according to the measurement of overall 
variance, which was implemented with the varFilter function using 
the genefUter package from the Bioconductor project [11] within 
R software [12]. After the preprocessing, a total of 27, 336 probe 
sets from each sample were used for further analysis. 

Feature Selection 

SVM-RFE algorithm. The idea of tiie SVM-RFE algoritiim 
is to use the weight magnitude of the SVM classifier as a feature 

ranking criterion to produce a feature ranked list [7] . The SVM- 
RFE algorithm is defined as the iterative three steps: 

a) Train the SVM; 

b) Compute the ranking criterion (vc,)^ for all features based on 

the weight vector w; 

c) Remove the feature with smallest ranking criterion. 

When all the iterative procedures have finished, a feature 
ranked list i'=[fi/2,---/h<---Jn] is obtained according to the 
evaluations for features. 

ROC algorithm. The ROC curve is a particularly suitable 
and effective method to rank genes in regards to differential 
expression between tissues [8]. Suppose that FJ" and F| 
respectively represent the distributions of two phenotype groups 
for gene g. The idea of the ROC algorithm is to characterize 
separations and find a best one between the distributions for 
and Y'g. 

Then, the partial area under the curve pAUC(to) and the area 
under the entire curve AUC are commonly used to rank genes for 
differential expression in tissue samples. These two statistical 
measures are defined as equation (1) and (2): 



pAUCito)-- 



ROC(t)dt 



1 

-1 



AUC= \ROC{t)dt 



(1) 



(2) 



where is some small false positive rate. Differentially expressed 
genes can be ranked based on the results oi pAUC{tQ) and AUC. 

Boruta algorithm. The Boruta algorithm is designed to 
iteratively remove the features which are proven to be less relevant 
than random probes [9]. The random forest classification 
algorithm runs fast without usually tuning parameters and it gives 
an estimate of feature importance [13]. Briefly, it is an ensemble of 
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Figure 2. Analysis results of gene expression of the top 1000 genes selected from SVM-RFE (red symbol of star) and ROC (black 
symbol of star) algorithms. Genes with log fold change (FC) of expression >2 and adjuste P-value <0.01 were in the upper right area. 
doi:1 0.1 371 /journal.pone.01 00052.g002 
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Figure 3. Overlapping features based on the ranked feature 
sets generated by three algorithms. IVIodel 1: Support Vector 
IVlachine based on Recursive Feature Elimination (SVIVI-RFE) algorithm; 
IVIodel 2: Receiver Operating Characteristic (ROC) Curve algorithm; 
Model 3: Boruta algorithm. In this procedure, an overlapping set, 
including 8 features, was identified and used for gene matching. 
doi:1 0.1 371/journal.pone.01 00052.g003 



tree predictors in which each tree depends on the independent 
identically distributed random vectors in the forest. 

In the Boruta algorithm, shadow attributes for the original 
attributes are created by shufiling values of the original attributes 
across objects, and thus the importance of shadow attributes is 
estimated and used as a reference for deciding truly important 
attributes. By adding randomness into the model system and 
correcting for the random fluctuations based on the ensemble of 
extra randomness, the Boruta algorithm aims to determine which 
attributes are truly important. The Boruta algorithm was 
implemented using the Boruta package [9] within R. 

Data Encoding and Feature Selecting 

To encode microarray data to be fed into the feature selection 
step, the gene expression values were used to construct a gene 
expression matrix M, which was composed of 27,336 rows 
representing probe sets in each gene expression profile and 44 
columns representing samples. A T vector was generated to 
represent grouped statuses of each sample with "0" denoting the 
"control group" and "1" denoting the "multiple sclerosis group". 
Then, the matrix M and the vector T were input into the feature 
selection algorithms, which iteratively evaluated a candidate 
subset of features using the grouping information of samples, and 
generated a satisfactory feature subset. Due to two different kinds 
of feature selection algorithms (i.e., SVM-RFE and ROC 
algorithms rank genes in order, but Boruta algorithm directly 
generates a subset of genes with the label of "important") used in 
this study, we selected the top 1 ,000 results from each SVM-RFE 
and ROC algorithms, and the output genes with the label of 
"important" were chosen for the Boruta algorithm. In fact, the 
Boruta algorithm only generated a subset of significant genes 
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Figure 4. Receiver operating cliaracteristic (ROC) curves for evaluating identified features. AUC (area under the curve) and pAUC (partial 
area under the curve) indicators were computed to assess the performance for each feature. 
doi:1 0.1 371/journal.pone.01 00052.g004 
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Figure S. Scatter plot of expression values of eight features. 

Each panel in the above plot corresponds to one probe set. The y-axis 
represents the logarithmic expression intensity of each probe set, and 
the X-axis represents the samples. The red and blue colors respectively 
represent the multiple sclerosis and control groups. 
doi:1 0.1 371 /journal.pone.01 00052.g005 



from a pool of candidates. To determine the significant category 
of genes in the SVM-RFE and ROC algorithms as done by 
Boruta algorithm, the moderated /-test was applied to test the 
statistical significance of the top percentage of genes in the two 
algorithms. 

Gene Function Analysis 

Initially, each probe set from feature selection algorithms was 
mapped to an annotation of Entrez Gene ID and fuU gene name 
using the GeneCards Human Gene Database (http://www. 
genecards.org/). It is an integrated system that provides concise 
genomic related information, on all known and predicted human 
genes. GeneCards also gives out the counts of already reported 
studies as strength indicator of association between genes and 
potential diseases. We submitted gene symbols into GeneCards, 
and attempted to evaluate the associations. In the next phase, the 
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway 
enrichment of the identified genes was assessed using the 
database of Gene Annotation Tool to Help Explain Relationships 
(GATHER) (http://gather.genome.duke.edu/). GATHER is a 
proposed bioinformatics tool that can integrate various available 
data to extract the full value from molecular signatures produced 
from high-throughput assays [14]. We also performed a Gene 
Ontology enrichment analysis of genes based on GATHER. The 
GATHER system annotates genes with functional descriptors 
from Gene Ontology, and quantifies the significance of functional 
associations with a group of genes. The significance of association 
between a gene group and an annotation was assessed using a 
Bayes factor. The larger magnitude represents the stronger the 
functional association [14]. The P-values were statistically 
corrected for multiple testing using the Bonferroni method in 
this study. The limma package in R software was used to perform 
the moderated t-tesi to investigate the differential expression of 
the selected probes between multiple sclerosis patients and 
controls. 

Classification Models Building and Assessing 

Three feature selection algorithms were conducted to rank 
genes according to the algorithms' scorings. Each gene was ranked 
based on its prediction performance in each algorithm. After that 
three ranked gene sets were generated respectively, and an 
overlapped gene set was finally determined. Multiple classification 
models including SVM, Random Forests, naive Bayes, Artificial 
Neural Network, Logistic Regression and k-Nearest Neighbor 
were established using the MLIiiterfaces package in the R 
software. The 10-fold cross vahdation method was performed to 
assess the prediction accuracy of each classifier. The 1 0-fold cross- 
validation is an effective method to evaluate the performance 
classification models [15]. The principle of this approach is to 
randomly partition the original sample into ten subsamples. Of 
these subsamples, one single subsample is retained as the 
validation dataset for testing the model, and the remaining 
subsamples are used for training data. The process is repeated 10 
times, and the results are averaged to produce a final estimation of 
performance. 

In a classification model, each sample was predicted into one of 
the two groups, i.e. multiple sclerosis subjects and controls. We 
apphed the statistical measures of Sensitivity, Specificity, Accuracy 
and Fl score [16] for performance evaluation. The measures were 
defined as follows: 
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Table 1 . Annotations of gene symbol and full gene name for each selected probe set and the differential expression analysis using 
moderated t-test. 



Porbeset ID 


Gene Symbol 


Gene name 


logFC 


t 


/'-value 


adjusted /'-value 


1 559949_at 


TRPSl 


Trichorhinophalangeal syndrome 1 


6.2601 


18.9200 


4.95 E-23 


5.16E-23 


214329_x_at 


TNFSFIO 


Tumor necrosis factor (ligand) superfamily, member 10 


8.7780 


26.8322 


2.67E-29 


3.37E-29 


217782_s_at 


GPSl 


G protein pathway suppressor 1 


6.5087 


27.9268 


4.90E-30 


6.44E-30 


219284_at 


Hspbapl 


HSPB (heat shock 27 kDa) associated protein 1 


8.8694 


39.2049 


2.14E-36 


5.23 E-36 


225823_at 


C19orf70 


Chromosome 19 open reading frame 70 


7.5612 


27.2640 


1 .36E-29 


1.74E-29 


23021 4_at 


MRVn 


Murine retrovirus integration site 1 homolog 


5.6772 


27.4084 


1 .09E-29 


1.40E-29 


237588_at 


SMCHDl 


Structural maintenance of chromosomes flexible hinge 
domain containing 1 


5.8131 


26.7462 


3.06E-29 


3.85E-29 


1 556735_at 


Unknown 


Unknown 


6.6352 


25.8057 


1 .39E-28 


1 .69E-28 



doi:1 0.1 371 /journal.pone.01 00052.t001 
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where TP is the number of true positives, FMis the number of false 
negatives, TM is the number of true negatives and FP is the 
number of false positives. 

Results 

Ranking Genes of Multiple Sclerosis 

The SVM-RFE, ROC and Boruta algorithms were performed 
to rank all 27, 336 probe sets for each participating subject. Due to 
the Boruta algorithm outputting a subset of probe sets with the 
"important" lable, we chose the "important" probe sets and 
ordered the probe sets by the Z-score which indicating the 
measure of feature importance. However, the SVM-RFE and 
ROC algorithms directiy ranked the probe sets in a sequence set. 
To determine the significant category of genes in the SVM-RFE 
and ROC algorithms as done by Boruta algorithm, the moderated 
t-test was applied to obtain the significant genes (Figure 2). Based 
on the analysis results of the adjusted t-test, both two sets of the top 
1,000 genes were all significant, and their expressions of log fold 
change between the twop groups were more than 2 (Figure 2). 
Hence, three important sets of genes were integrated and dieir 
overlapping genes were investigated. A Venn diagram was used to 
similarly represent the intersection of the three sequence sets. The 
top significant genes from SVM-RFE and ROC algorithms and 
the important ones from Boruta algorithm were used to determine 
the overlapping genes (Figure 3). There were a total of 8 genes 
indicating the top hits from the three algorithms in the intersection 
of the diagram, and the expression values of these genes for each 
subject were used as the input variables in the classification 
models. 



Discriminative Ability of each Gene 

We visualized the expression profiles of the 8 genes in all 44 
samples using ROC curves to illustrate the discriminative power 
between the two classes of samples for each gene (Figure 4). The 
indicators including pAUC (partial area under the curve) and 
AUC (area under the curve), were computed to assess the 
performance for each feature. The variation of AUC of the 8 genes 
ranged from 0.711 (probe 217782_s_at) to 0.852 (probe 
230214_at), and 6 of them had AUO0.78. Both the AUC and 
pAUC measures suggested the features held good classification 
performance. 

A scatter plot for the 8 genes was also used to illustrate their 
discriminative power between the two classes of samples (Figure 5). 
Each panel in the plot corresponds to one feature gene, and 
different expression levels of these genes between the two groups 
can be observed. According to the scatterplot, these 8 genes clearly 
showed differential expression between multiple sclerosis patients 
and controls, supporting the ability of these genes to differentiate 
between individuals with and without multiple sclerosis. 

Gene Ontology and KEGG Pathway Enrichment Analysis 

Each selected probe set was mapped to an annotation of Entrez 
Gene ID and the full gene name using the GeneCards database 
(Table 1). These 8 selected probes showed significantly differential 
expression between multiple sclerosis patients and controls (all 
adjusted P-values<0.05), and their log fold change were consis- 
tently greater than 2 (Table 1). The KEGG enrichment analysis 
(Table 2) revealed that the identified genes were closely related to 
apoptosis and cytokine-cytokine receptor interaction pathways (all 
adjusted P-values<0.05). TNFSFIO were suggested to be poten- 
tially associated with multiple sclerosis. In the Gene Ontology 
enrichment analysis, differentially expressed genes in multiple 
sclerosis subjects versus controls mainly involved protein kinase 
cascadse, inactivation of MAPK, regulation of signal transduction 
and apoptosis (Table SI in File SI.). Differentially regulated genes 
primarily included TNFSFIO, GPSl and TRPSl. The informa- 
tion retrieved from GeneCards showed there were she published 
studies reporting on the relationship between TNFSFIO and 
multiple sclerosis (Table S2 in File SI.). 

A Robust Gene Expression Profile Classifier 

As evaluated with a 10-fold cross-validation method in the 
whole dataset, the SVM model had the best discriminative ability 
with a predictive accuracy of around 86% (Table 3). The p 
predictive accuracy of SVM was higher than the figures of the rest 
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models. In terms of the Sensitivity, Specificity and Fl score, SVM 
also outperformed the other models, as has been observed when 
compared with bio-inspired algorithms [17-20]. The Sensitivity, 
Specificity and Fl score for SVM reached over 92%, 78% and 
89%, respectively. The R code for the feature selection algorithms 
and classification model building were provided (Table S3 in File 
SI.). 

Discussion 

At present, microarray technology is extensively used in 
biomedical research, and the data processing method is the key 
part for analyzing gene-chip results. Questions remain as to how to 
analytically deal with this type of data. The challenge is to detect a 
panel of discriminative genes from a large pool of candidate genes 
[21,22]. To analyze the microarray data, this study proposed to 
integrate three feature ranking algorithms (SVM-RFE, ROC and 
Boruta) as the core into a combined algorithm. The combined 
algorithm generated an ordered gene set that consists of genes at a 
medium size. This work established a classification model for gene 
selection using multiple sclerosis gene expression data. The 
distinction between the three feature selection algorithms and 
the classification models was that the feature selection algorithms 
were used to detect a group of discriminative genes from a large 
number of candidates, reducing the dimensionality of data sets, 
and the models were buUt and assessed based on the selected genes 
for sample predictions. In evaluating the performance of different 
models, four measures including Sensitivity, Specificity, Accuracy 
and F 1 score were calculated based on the confusion matrix output 
by each classifier using total dataset. Sensitivity (the true positive 
rate) measures the proportion of true positives which are correcdy 
identified, and Specificity (the true negative rate) measures the 
proportion of negatives which are correctly identified. Accuracy 
and Fl score measures a model's prediction accuracy rate. All the 
four statistics reach their best values at 1 and worst score at 0. We 
assessed the four statistics, and determined a relative optimal 
classifier with highest Sensitivity, Specificity, Accuracy and Fl 
score. 

In this study, 8 genes were identified to be associated with 
multiple sclerosis. We built an SVM as the best model for sample 
prediction, having a predictive accuracy of around 86%. The 
SVM outperformed the other models as assessed by Sensitivity, 
Specificity, Fl score and Accuracy. The KEGG enrichment 
analysis suggested that the genes selected were statistically related 
to pathways involving apoptosis and cytokine-cytokine receptor 
interaction. Among the 8 genes, TNFSFIO had a close relationship 
with multiple sclerosis. Gene Ontology enrichment analysis 
revealed that TNFSFIO involved in the biological processes 
including protein kinase cascades, regulation of signal transduction 
and apoptosis, and the GPS 1 and TRPS 1 were primarily enriched 
in multiple sclerosis. 

Apoptosis is a common regulatory mechanism for maintaining 
normal development and homeostasis of the immune system. 
Because the process of eliminating auto-reactive T cells via 
apoptosis is impaired in multiple sclerosis, apoptosis signaling- 
related genes may be strong candidate genes for involvement in 
multiple sclerosis [23]. According to the GeneCards database, 
there were six pubhshed studies [24-29] referring to the 
relationship between TNFSFIO and multiple sclerosis, indicating 
TNFSF 1 0 might have an important role in multiple sclerosis. The 
increasing expression of TNFSFIO was observed in peripheral 
blood mononuclear cells of patients with multiple sclerosis. 
TNFSFIO belongs to the tumor necrosis factor/nerve growth 
factor superfamily [30], and can induce cell death or apoptosis of 
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Table 3. Evaluation of nnultiple classification models including Support Vector Machine (SVM), Randonn Forest (RF), naive Bayes 
(Bayes), Neural Network (NNT), K-Nearest Neighbor (KNN) and Logistic regression models via 10-fold cross-validation (lOFCV). 



Evaluation method 


Model 


Sensitivity 


Specificity 


F1 score 


Accuracy 


10FCV 


SVM 


0.9231 


0.7778 


0.8889 


0.8636 




RF 


0.8462 


0.7222 


0.8302 


0.7955 




naTve Bayes 


0.6923 


0.8889 


0.7826 


0.7727 




NNT 


0.8846 


0.7222 


0.8519 


0.8182 




KNN 


0.8462 


0.7222 


0.8302 


0.7955 




Logistic 


0.7692 


0.7778 


0.8000 


0.7727 



doi:l 0.1 371 /journal.pone.Ol 00052.t003 



inflammatory cells. Blockade of TNFSFIO expressed in CD4-I- 
myelin-specific T cells reduces caspase-dependent neuronal cell 
death in an experimental animal model for multiple sclerosis [31]. 
TNFSF 1 0 involves both in cell death and other immunoregulatory 
mechanisms. According to Kikuchi et al. [24], the presence of the 
CC genotype in the coding region of TNFSFIO at position 1595 in 
exon 5 associated with a higher risk of multiple sclerosis in 
Japanese patients. Also, more than 80% of the top 30 most 
significant genes in multiple sclerosis were categorized into 
apoptosis signaling-related genes, and among them TNFSFIO 
was one of the significantiy up-regulated genes [25]. In addition, a 
more recent candidate gene case-control study in the Spanish 
population finds an association of 3 SNPs in TRAIL, TRAILR- 1 
and TRAILR-2 genes with susceptibility to multiple sclerosis [32]. 

Besides TNFSF 1 0, the rest 7 genes showed markedly differential 
expression between multiple sclerosis patients and controls, 
appearing to be functionally related to apoptosis. TRPS 1 executes 
multiple functions in proliferating chondrocytes and activates 
proliferation in columnar cells according to the function annota- 
tions from the GeneCards database. TRPSl was also suggested to 
be an apoptosis-associated gene that acts as a death-signaling gene 
to induce the elimination of cells via apoptosis [33]. GPSl is 
known to suppress survival-associated mitogen-activated protein 
kinase-mediated signal transduction [34—38]. Hspbapl is believed 
to inhibit the neuroprotective effects of heat shock protein 27, and 
is found extensively in the anterior temporal neocortex of patients 
with intractable epilepsy [39]. MR VII and SMCHDl are 
respectively linked to blood coagulation and chromosome 
organization. 

Several studies [8,40-42] had explored gene expression patterns 
in multiple sclerosis. Brynedal et al. [4] evaluated the association 
between transcripts and group specificity using i-tests to detect 
differentially expressed genes, and estimated the fold change of 
genes between different groups. However, these studies identified a 
large amount of differentially regulated transcripts between 
different groups. Indeed, it is important to apply more effective 
approaches to analyze microarray data, where there are many 
thousands of features, and a few tens to hundreds of samples. 
Using the existing Z-test approach to detect differentially expressed 
genes between samples always increases the discovery rate of false 
positive. Prior studies [5,6] showed that combinatorial gene 
selection methods could be effectively applied to identify disease- 
related genes. Inspired by this idea, this work proposed a 
combinational strategy to predict multiple sclerosis samples using 



microarray data. Gene Ontology analysis in this study showed that 
the MAPK and protein kinase cascade signaling pathways were 
enriched in patients with multiple sclerosis, which was consistent 
with the results from Brynedal et al. [4] . 

This work performed a combined approach integrating feature 
ranking algorithms and an SVM classification model for gene 
selection. We can estimate the discriminative ability of each gene 
using the proposed approach, allowing an objective and quanti- 
tative evaluation of each gene. Due to the limitation that more 
gene expression profile datasets of multiple sclerosis cannot be 
available at present, other independent datasets are necessary to 
an appropriate validation of the algorithm in the future. 

Supporting Information 

File SI File SI: Contains Tables S1-S3. Table SI. Gene 
Ontology analysis of the selected genes using GATHER 

(http://gather.genome.duke.edu/). Table S2. The strength of 
association between genes and disease indicated as the 
counts of publications retrieved from GeneCards (until 
September 1, 2012). Accordingly, more related studies 
retrieved by GeneCards supports much stronger association 
between genes and potential diseases. Table S3. R code of 
feature selection algorithms and a robust SVM classifi- 
cation model. Feature selection algorithms (SVM-RFE, ROC 
and Botuta) and classification models (SVM, Random Forests, 
naive Bayes, Artificial Neural Network, Logistic Regression and k- 
Nearest Neighbor) were buUt within R software. The symbol of 

referred to the program annotation. 
(DOC) 
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